Robust, Lexicalized Native Language Identification
نویسندگان
چکیده
Previous approaches to the task of native language identification (Koppel et al., 2005) have been limited to small, within-corpus evaluations. Because these are restrictive and unreliable, we apply cross-corpus evaluation to the task. We demonstrate the efficacy of lexical features, which had previously been avoided due to the within-corpus topic confounds, and provide a detailed evaluation of various options, including a simple bias adaptation technique and a number of classifier algorithms. Using a new web corpus as a training set, we reach high classification accuracy for a 7-language task, performance which is robust across two independent test sets. Although we show that even higher accuracy is possible using crossvalidation, we present strong evidence calling into question the validity of cross-validation evaluation using the standard dataset.
منابع مشابه
Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization
In this paper, we systematically explore lexicalized and non-lexicalized local syntactic features for the task of Native Language Identification (NLI). We investigate different types of feature representations in singleand cross-corpus settings, including two representations inspired by a variationist perspective on the choices made in the linguistic system. To combine the different models, we ...
متن کاملOffline Language-free Writer Identification based on Speeded-up Robust Features
This article proposes offline language-free writer identification based on speeded-up robust features (SURF), goes through training, enrollment, and identification stages. In all stages, an isotropic Box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of word region and the corresponding scales and orientations (SOs) are extr...
متن کاملData Driven Language Transfer Hypotheses
Language transfer, the preferential second language behavior caused by similarities to the speaker’s native language, requires considerable expertise to be detected by humans alone. Our goal in this work is to replace expert intervention by data-driven methods wherever possible. We define a computational methodology that produces a concise list of lexicalized syntactic patterns that are control...
متن کاملFinding Target Language Correspondence for Lexicalized EBMT System
This paper presents a three-phase approach to find the correspondence in Target Language (TL) sentence for a fragment of Source Language (SL) sentence in a lexicalized EBMT system. To be practical, it exploits surface information as much as possible instead of using parsers. Experiments show that, although not so perfect, it is very robust and effective. The three phases are: First, align the s...
متن کاملL1 Glossing and Lexical Inferencing: Evaluation of the Overarching Issue of L1 Lexicalization
This empirical study reports on a cross-linguistic analysis of the overarching issue of L1 lexicalization regarding two (non)-interventionist approaches to vocabulary teaching. Participants were seventy four juniors at the Islamic Azad University, Roudehen Branch in Tehran. The investigation pursued (i) the impact of the provided (non)-interventionist treatments on both sets of (non)-lexicalize...
متن کامل